Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp #12049

Open
wants to merge 30 commits into
base: master
Choose a base branch
from

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented Feb 24, 2025

PR Description

this PR is a continued effort of my original PR #6869

thanks to the huge changes in the software architecture in the latest llama.cpp (especially the maturation of the "backend scheduler" feature),

  • data path of ggml-qnn backend works pretty good as expected with whisper.cpp and llama.cpp.
  • the official command line tool "test-backend-ops" & "llama-cli" has verified on a Qualcomm Snapdragon 8 Gen3 equipped Android phone.
  • works pretty good with ASR inference via whisper.cpp and LLM inference via llama.cpp with a standard Android APP(which is a self-made complex Android APP) on a Qualcomm Snapdragon 8 Gen 3 equipped Android phone.

this implementation put main logic in one single source file(ggml-qnn.cpp) because it will helpful for other experienced programmers be involved in dev activity which similar to what ggerganov did in the very beginning of ggml.c/llama.cpp or what Intel did in the very beginning of ggml-sycl.cpp or what Qualcomm did in the ggml-opencl.cpp. another main reason of this coding style is I think this will make the developers' workflow more easily:

  • try to overcome all relevant technical issues/difficulties with a specified op GGML_OP_ADD or GGML_OP_MUL_MAT
  • then expand other ggml ops accordingly with team-work from AI experts in the upstream llama.cpp community
  • the last step is code reconstruction via complex C++ encapsulation for a real project or specified need(this is not mandatory)
  • cross validation or internal dev activities will be convenient cause of only replace one single source file

this implementation is a concise implementation and focus on the final mission "how to utilize the Hexagon NPU maximally", this implementation is not cool(lack of some cool modern C++ features such as lambda expression and complex/complicated C++ encapsulation), but I hope every domain programmers can understand codes and domain technical details easily and quickly and it can be considered an open source reference implementation of ggml-qnn and also can/might be used in production project . I think this PR will be helpful to the llama.cpp community although this PR might be not accepted.

after spent too much efforts on ggml-qnn backend, I personally think a fully open-source ggml-qnn backend might be a team-work between experienced software programmers and AI experts even the professional technical help from Qualcomm. in other words, I personally think NO single independent programmer or independent development team can provide a fully implementation of ggml-qnn backend, because experts and programmers in this community should haven't seen a programmer who is familiar with both Android system software programming and Windows system software programming, and is proficient in hard-core AI technology and Qualcomm QNN SDK, one more important thing, familiar with source code of ggml/llama.cpp although he/she is not an AI expert.

Big picture of ggml-qnn backend

pls refer to a simple tech doc below: mapping ggml compute graph to QNN compute graph

the first technical approach can be seen in this PR. accordingly, the second technical approach can be easily extended base on this PR with the similar coding style or complex/complicated C++ encapsulation.

What I did in my first PR and this PR

  • data path works good between QNN SDK and ggml/llama.cpp through reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm), I personally think this is the first open-source implementation of ggml-qnn in llama.cpp community since 04/2024(correction is greatly welcomed if I made incorrect expression)
  • graph cache mechanism
  • a simple skeleton in function ggml_qnn_add:offload GGML_OP_ADD to QNN backend
  • a complex skeleton in function ggml_qnn_mulmat: offload GGML_OP_MULMAT to QNN backend, this skeleton can be used to illustrate the second technical approach of "how to utilize the Hexagon NPU maximally"
  • QNN NPU RPC feature(UT passed but some unknown bugs should be fixed and should be seen in all hard-forked ggml-qnn projects, this is not an intentional bug)
  • provide big picture and different technical approaches of ggml-qnn in my forked llama.cpp and this PR
  • overcome necessary technical difficulties in this PR
  • quantized data type with 2D mulmat and a very significant performance improvement for LLM inference with ggml-qnn backend(added on 02/26/2025,12:40, pass UT and CT through test-backend-ops and llama-cli)
  • a concise implementation without complex encapsulation, code is simple and easily understand, because I believe in the philosophy of "simple is beautiful" which comes from the great Unix which borned in the US and this philosophy also can be found in the great llama.cpp. btw, I personally think that's one of the key-reasons why ggml/whisper.cpp/llama.cpp is so popular and so successful, especially we can see that there are already TensorflowLite from Google, Executorch from Meta, MNN from Alibaba, MACE from Xiaomi... they are both good but many programmers and IT giants and IT startups and research institutions prefer ggml for device-AI scenarios. accordingly, other experienced programmers can do code construction with cool and complex C++ wrapper/encapsulation for a real product although the core ideas and tech difficulties are same to this implementation.

all above items can be found in project KanTV if there is no specified notes and project KanTV is a device-AI learning project and heavily depend on ggml/whisper.cpp/llama.cpp. I personally think the rest parts of ggml-qnn is a team work between AI experts and experienced programmers. we can work together to achieve the final goal: provide a productive-level open-source ggml-qnn backend to llama.cpp community.

Performance of ggml-qnn backend

performance is the key-point I heavily/always concerns on Android or other embedded system. here is the result in local dev envs of how I solve this performance issue at the early phase(because this ggml-qnn backend lacks of many ops although it's a really functional backend) of ggml-qnn backend:

before finetune:

Fig-1:llama-bench with QNN NPU backend(Hexagon NPU) on Xiaomi14
Screenshot from 2025-02-22 11-43-52

after finetune:
Fig-2: llama-bench with QNN NPU backend(Hexagon NPU) on Xiaomi14, some mulmat operations has been offloaded to NPU
Screenshot from 2025-02-22 11-30-39

Fig-3:llama-bench with ggml cpu backend("3" is a "fake" ggml-qnn backend which used to compare performance between QNN NPU backend and ggml cpu bakcned)) on Xiaomi14(AI experts can explain why there is so big difference of the second data between Fig-2 and Fig-3, this would be helpful for performance fine-tuning in this PR)
Screenshot from 2025-02-22 12-15-08

How to setup dev envs on a Linux machine

Ubuntu 20.04,22.04 is validated and recommended, other Linux distributions might be also ok.

the dev activity in this PR can be done in pure command line without any IDE, so setup dev envs is simple:

How to build ggml‐qnn source code for Android and verify ggml-qnn backend on Snapdragon based phone

  git clone https://github.com/kantv-ai/ggml-qnn
  cd ggml-qnn
  git checkout build_fix
  ./scripts/build-run-android.sh build          (it'll setup local build envs automatically and build the entire project)
  ./scripts/build-run-android.sh updateqnnlib   (upload Qualcomm's QNN binary runtime libs to Android phone)
  ./scripts/build-run-android.sh run_llamacli   (running llama-cli on Android pohone)
  ./scripts/build-run-android.sh run_testop     (running test-backend-ops on Android phone)

we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-qnn". for programmers, we can use "adb logcat | grep ggml-qnn" to help troubleshooting work.

How to build ggml‐qnn source code for Snapdragon based WoA(Windows on ARM) device

I have nothing knowledge about Windows programming although the source code of ggml/llama.cpp are both portable codes and Qualcomm's highly-well designed QNN SDK is also portable.

I know @chraac 's team has done them from a loop in this community. I think it can be migrated to this with help from his team or by me manually after I carefully check the implementation of Windows port from his team.

Acknowledgement

  1. this implementation of ggml-qnn is mainly porting/reverse engineering from executorch(the QNN backend's implementation in executorch comes from Qualcomm).this is also one of the key-reasons why I personally disagree another implementation through complex/complicated C++ encapsulation also complex C++ encapsulation really has some advantages.
  2. I got breakthrough help from chiwwang@Qualcomm Technologies Inc on 04/2024.
  3. I also got a meaningful help from XiaoMi-StableDiffusionOnDevice on 05/2024.
  4. thanks for that I borrowed 5-7 functions from implementation which comes from a CN programmer @chraac 's team. I can see that there are many people in his team and they really did a good progress on Windows port and 4D matrix mulmat. I have really too much unpleasant engagement experiences with this CN programmer, but I still hope we can work together to provide a production-level open-source ggml-qnn backend to the great llama.cpp community, eliminate pointless arguments such as OO principle through complex C++ encapsulation or function pointers array through C(we are all experienced programmers and we all know that, such this arguments are totally meaningless although I meet these scenarios many times in my past career because there are many "C++ language lawyers" in China), meaningless tech competition(I deeply understand the extreme involution can be seen in China everywhere which I think totally meaningless), follow some default rules: verify and validate PR firstly before drop tech comments, follow the coding style in this PR, don't drop meaningless comments in this PR, don't reference your standalone PR in this PR although I wish your success in this community, don't answer tech questions for the PR author unless the PR author actively seeks help, don't challenge the PR author intentionally ...... one more important thing, I hope the code reviewer is a real domain tech expert rather than challenge the PR author intentionally ------ I personally don't drop code review comments in other people's PR even I know something about that topic because I know I'm not domain tech expert with that topic and I fully understand that we will not agree on everything. in the all, this tech community is out of Mainland China, don’t brings some Chinese habits here, don't be another meaningless joke from China. of course, they can continue their way and I wish their success in this community. I personally think winners or losers in Mainland China are both totally meaningless(e.g.: all winners and losers from Mainland China must access this community via a dedicated VPN or proxy, I don’t dare to mention other well-known examples) although much of them prefer to be winner. thanks so much!

AI-assisted programming for ggml-qnn backend

recently I tried AI-assisted programming for ggml-qnn backend with the help from the powerful Grok 3(I tried DeepSeek-R1 but failed, I personally think Grok 3 is similar/closer to a natural human domain expert after it really helped me a lot in this PR, my English level can't express my real feeling of Grok 3 and we can try Grok 3).

@ggerganov @slaren thanks to your outstanding "backed-schedualr" feature, this ggml-qnn backend now is a functional QNN backend and performance of LLM inference on Snaprdragon 8 Gen3 is greatly improved(10x-13x). pls have a look when you have time, it will be helpful for other domain experts and AI experts be involved in dev activities if this PR can be accepted: there are many QNN SDK API assembling and calling in the rest parts of ggml-qnn regardless of C style or C++ style, and the rest parts of ggml-qnn is not easy and much workloads and efforts are required for a real product, especially the team work between AI experts and experienced programmers. I know Qualcomm's experts already participate in llama.cpp's dev activities, might be they can help to do code review. thanks so much.

@zhouwg zhouwg closed this Feb 24, 2025
@github-actions github-actions bot added script Script related testing Everything test related labels Feb 24, 2025
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Feb 24, 2025
@zhouwg zhouwg changed the title build: avoid ggml-qnn backend breaking other backend's builds PR: Refine ggml-qnn backend(QNN, Qualcomm Neural Network,aka Qualcomm AI Engine Direct) for latest ggml,whisper.cpp,llama.cpp Feb 24, 2025
@zhouwg zhouwg reopened this Feb 24, 2025
@zhouwg
Copy link
Contributor Author

zhouwg commented Feb 24, 2025

a simple tech doc: mapping ggml compute graph to QNN compute graph

with the breakthrough help from chiwwang@QTI on April 2024,

//==============================================================================
//
//  Copyright (c) 2020-2024 Qualcomm Technologies, Inc.
//  All Rights Reserved.
//  Confidential and Proprietary - Qualcomm Technologies, Inc.

//  saver_output.c is generated automatically by Qualcomm's dedicated tool
//
//  this customized saver_output.c is used to troubleshooting issue in
//  PoC-S26: offload a simple f32 2x2 matrix addition operation to QNN CPU backend
//  https://github.com/zhouwg/kantv/issues/121
//
//==============================================================================

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include "QnnInterface.h"

#include "ggml-jni.h"

#define VALIDATE(res, value) \
   do { \
      if (res == 0 || res == QNN_COMMON_ERROR_NOT_SUPPORTED) \
      { \
         res = value; \
         if (res != 0) \
         { \
            if (res == QNN_COMMON_ERROR_NOT_SUPPORTED) \
            { \
               LOGGD("WARNING! Line %d QNN feature/API not supported\n", __LINE__); \
               GGML_JNI_NOTIFY("WARNING! Line %d QNN feature/API not supported\n", __LINE__); \
            } else { \
               LOGGD("ERROR! Line %d with error value: %d\n", __LINE__, (unsigned int)error); \
            } \
         } \
      } \
   } \
   while(0)


static void qnn_saver_logcallback(const char* fmt,
                                 QnnLog_Level_t level,
                                 uint64_t timestamp,
                                 va_list argp) {

    static unsigned char s_qnn_saver_buf[JNI_BUF_LEN];

    const char * levelStr = "";
    switch (level) {
        case QNN_LOG_LEVEL_ERROR:
            levelStr = " ERROR ";
            break;
        case QNN_LOG_LEVEL_WARN:
            levelStr = "WARNING";
            break;
        case QNN_LOG_LEVEL_INFO:
            levelStr = "  INFO ";
            break;
        case QNN_LOG_LEVEL_DEBUG:
            levelStr = " DEBUG ";
            break;
        case QNN_LOG_LEVEL_VERBOSE:
            levelStr = "VERBOSE";
            break;
        case QNN_LOG_LEVEL_MAX:
            levelStr = "UNKNOWN";
            break;
    }

    double ms = (double)timestamp / 1000000.0;

    {
        int len_content = 0;
        memset(s_qnn_saver_buf, 0, JNI_BUF_LEN);
        len_content = vsnprintf(s_qnn_saver_buf, JNI_BUF_LEN, fmt, argp);
        snprintf((s_qnn_saver_buf + len_content), JNI_BUF_LEN - len_content, "\n");
        LOGGD("%8.1fms [%-7s] %s ", ms, levelStr, s_qnn_saver_buf);
        //if (level <= QNN_LOG_LEVEL_INFO)
        {
            GGML_JNI_NOTIFY("%8.1fms [%-7s] %s ", ms, levelStr, s_qnn_saver_buf);
        }
    }
}

int qnn_saver_main(int argc, char **argv) {
    LOGGI("enter %s", __func__);
    GGML_JNI_NOTIFY("enter %s", __func__);
    Qnn_ErrorHandle_t error = 0;
    QnnLog_Level_t logLevel = QNN_LOG_LEVEL_VERBOSE;
    int logging = 1;
    for (int i = 1; i < argc; i++) {
        char *arg = argv[i];
        if (!strcmp("--logging", arg) || !strcmp("-l", arg)) {
            logging = 1;
            if (i + 1 == argc) {
                printf("No log level provided, defaulting to QNN_LOG_LEVEL_ERROR\n");
                break;
            }
            char *value = argv[++i];
            if (!strcmp("error", value)) {
                logLevel = QNN_LOG_LEVEL_ERROR;
            } else if (!strcmp("warn", value)) {
                logLevel = QNN_LOG_LEVEL_WARN;
            } else if (!strcmp("info", value)) {
                logLevel = QNN_LOG_LEVEL_INFO;
            } else if (!strcmp("debug", value)) {
                logLevel = QNN_LOG_LEVEL_DEBUG;
            } else if (!strcmp("verbose", value)) {
                logLevel = QNN_LOG_LEVEL_VERBOSE;
            } else {
                printf("WARNING: Unknown log level provided: %s, defaulting to QNN_LOG_LEVEL_ERROR\n",
                       value);
            }
        } else {
            printf("Usage: %s [options]\n\n"
                   "-l <level>, --logging <level>      Enable logging, acceptable levels are: error,warn,info,debug,verbose\n",
                   argv[0]);
            return -1;
        }
    }

    LOGGD("log level %d\n", logLevel);
    FILE *fp = fopen("/sdcard/kantv/params.bin", "rb");
    if (!fp) {
        error = -1;
        LOGGI("ERROR! Could not open params.bin, ensure this file is in the current working directory when executing this program\n");
        GGML_JNI_NOTIFY("ERROR! Could not open params.bin, ensure this file is in the current working directory when executing this program\n");
        return error;
    }

    const QnnInterface_t **providerList = NULL;
    uint32_t numProviders;
    VALIDATE(error, QnnInterface_getProviders(&providerList, &numProviders));
    LOGGD("numProviders %d\n", numProviders);
    GGML_JNI_NOTIFY("numProviders %d\n", numProviders);
    for (int idx = 0; idx < numProviders; idx++) {
        LOGGD("backend name %s\n", providerList[idx]->providerName);
        GGML_JNI_NOTIFY("backend name %s\n", providerList[idx]->providerName);
    }
    QNN_INTERFACE_VER_TYPE interface = providerList[0]->QNN_INTERFACE_VER_NAME;

    Qnn_LogHandle_t loghandle = NULL;
    if (logging) {
        VALIDATE(error, interface.logCreate(qnn_saver_logcallback, logLevel, &loghandle));
    }
    //VALIDATE(error, interface.propertyHasCapability((QnnProperty_Key_t) 304)); //QNN_PROPERTY_GRAPH_SUPPORT_NULL_INPUTS
    VALIDATE(error, interface.propertyHasCapability((QnnProperty_Key_t) QNN_PROPERTY_GRAPH_SUPPORT_NULL_INPUTS));

    const QnnBackend_Config_t *backend_0_config_0[] = {NULL};
    Qnn_BackendHandle_t backend_0;
    VALIDATE(error, interface.backendCreate(loghandle, backend_0_config_0, &backend_0));

    const QnnDevice_Config_t *device_0_config_0[] = {NULL};
    Qnn_DeviceHandle_t device_0;
    VALIDATE(error, interface.deviceCreate(loghandle, device_0_config_0, &device_0));

    const QnnContext_Config_t *context_0_config_0[] = {NULL};
    Qnn_ContextHandle_t context_0;
    VALIDATE(error, interface.contextCreate(backend_0, device_0, context_0_config_0, &context_0));

    const QnnGraph_Config_t *context_0_convReluModel_config_0[] = {NULL};
    Qnn_GraphHandle_t context_0_convReluModel;
    VALIDATE(error,
             interface.graphCreate(context_0, "convReluModel", context_0_convReluModel_config_0,
                                   &context_0_convReluModel));

    //how to compose qnn graph

    //step-1:
    uint32_t context_0_convReluModel_tensor_0_dims[] = {1, 299, 299, 3};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_0_quantizeParams = {
            (Qnn_Definition_t) 2147483647/*QNN_DEFINITION_UNDEFINED*/,
            (Qnn_QuantizationEncoding_t) 2147483647/*QNN_QUANTIZATION_ENCODING_UNDEFINED*/, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_0_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_0_v1 = {0, "input_0",
                                                          (Qnn_TensorType_t) 0/*QNN_TENSOR_TYPE_APP_WRITE*/,
                                                          0/*QNN_TENSOR_DATA_FORMAT_FLAT_BUFFER*/,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_0_quantizeParams,
                                                          4, context_0_convReluModel_tensor_0_dims,
                                                          (Qnn_TensorMemType_t) 0/*QNN_TENSORMEMTYPE_RAW*/,
                                                          context_0_convReluModel_tensor_0_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_0 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_0_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_0));




    //step-2:
    uint32_t context_0_convReluModel_tensor_1_dims[] = {3, 3, 3, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_1_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static float context_0_convReluModel_tensor_1_data[864];
    fread(context_0_convReluModel_tensor_1_data, 4, 864, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_1_clientBuf = {
            (void *) context_0_convReluModel_tensor_1_data, 3456};
    Qnn_TensorV1_t context_0_convReluModel_tensor_1_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_weight",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/,
                                                          0/*QNN_TENSOR_DATA_FORMAT_FLAT_BUFFER*/,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_1_quantizeParams,
                                                          4, context_0_convReluModel_tensor_1_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_1_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_1 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_1_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_1));



    //step-3:
    uint32_t context_0_convReluModel_tensor_2_dims[] = {32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_2_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static float context_0_convReluModel_tensor_2_data[32];
    fread(context_0_convReluModel_tensor_2_data, 4, 32, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_2_clientBuf = {
            (void *) context_0_convReluModel_tensor_2_data, 128};
    Qnn_TensorV1_t context_0_convReluModel_tensor_2_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_bias",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/,
                                                          0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_2_quantizeParams,
                                                          1, context_0_convReluModel_tensor_2_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_2_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_2 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_2_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_2));



    //step-4:
    uint32_t context_0_convReluModel_tensor_3_dims[] = {2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_3_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_3_data[2];
    fread(context_0_convReluModel_tensor_3_data, 4, 2, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_3_clientBuf = {
            (void *) context_0_convReluModel_tensor_3_data, 8};
    Qnn_TensorV1_t context_0_convReluModel_tensor_3_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_dilation",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_3_quantizeParams,
                                                          1, context_0_convReluModel_tensor_3_dims,
                                                          (Qnn_TensorMemType_t) 0/*QNN_TENSORMEMTYPE_RAW*/,
                                                          context_0_convReluModel_tensor_3_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_3 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_3_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel, &context_0_convReluModel_tensor_3));




    //step-5:
    uint32_t context_0_convReluModel_tensor_4_dims[] = {2, 2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_4_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_4_data[4];
    fread(context_0_convReluModel_tensor_4_data, 4, 4, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_4_clientBuf = {
            (void *) context_0_convReluModel_tensor_4_data, 16};
    Qnn_TensorV1_t context_0_convReluModel_tensor_4_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_pad_amount",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_4_quantizeParams,
                                                          2, context_0_convReluModel_tensor_4_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_4_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_4 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_4_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_4));




    //step-6:
    uint32_t context_0_convReluModel_tensor_5_dims[] = {2};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_5_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    static uint32_t context_0_convReluModel_tensor_5_data[2];
    fread(context_0_convReluModel_tensor_5_data, 4, 2, fp);
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_5_clientBuf = {
            (void *) context_0_convReluModel_tensor_5_data, 8};
    Qnn_TensorV1_t context_0_convReluModel_tensor_5_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_stride",
                                                          (Qnn_TensorType_t) 4/*QNN_TENSOR_TYPE_STATIC*/, 0,
                                                          (Qnn_DataType_t) 306/*QNN_DATATYPE_UINT_32*/,
                                                          context_0_convReluModel_tensor_5_quantizeParams,
                                                          1, context_0_convReluModel_tensor_5_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_5_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_5 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_5_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_5));




    //step-7:
    uint32_t context_0_convReluModel_tensor_6_dims[] = {1, 149, 149, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_6_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_6_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_6_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_BatchNorm_FusedBatchNorm_0",
                                                          (Qnn_TensorType_t) 3/*QNN_TENSOR_TYPE_NATIVE*/, 0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_6_quantizeParams,
                                                          4, context_0_convReluModel_tensor_6_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_6_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_6 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_6_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel, &context_0_convReluModel_tensor_6));


    //step-8:
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_0 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "dilation",
            .tensorParam = context_0_convReluModel_tensor_3
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_1 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "pad_amount",
            .tensorParam = context_0_convReluModel_tensor_4
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_2 = {
            (Qnn_ParamType_t) 1/*QNN_PARAMTYPE_TENSOR*/,
            "stride",
            .tensorParam = context_0_convReluModel_tensor_5
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_3 = {
            (Qnn_ParamType_t) 0/*QNN_PARAMTYPE_SCALAR*/,
            "group",
            .scalarParam = {
                    (Qnn_DataType_t) 306, .uint32Value = 1}
    };
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_params[] = {
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_0,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_1,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_2,
            context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_param_3};

    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_inputs[] = {
            context_0_convReluModel_tensor_0,
            context_0_convReluModel_tensor_1,
            context_0_convReluModel_tensor_2};

    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_outputs[] = {
            context_0_convReluModel_tensor_6
    };

    Qnn_OpConfig_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0 = {
            (Qnn_OpConfigVersion_t) 1,
            .v1 = {
                    "InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D",
                    "qti.aisw",
                    "Conv2d",
                    4,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_params,
                    3,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_inputs,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0_outputs
            }
    };
    VALIDATE(error, interface.backendValidateOpConfig(backend_0, context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0));
    VALIDATE(error, interface.graphAddNode(context_0_convReluModel, context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Conv2D_0));




    //step-9:
    uint32_t context_0_convReluModel_tensor_7_dims[] = {1, 149, 149, 32};
    Qnn_QuantizeParams_t context_0_convReluModel_tensor_7_quantizeParams = {
            (Qnn_Definition_t) 2147483647,
            (Qnn_QuantizationEncoding_t) 2147483647, .scaleOffsetEncoding = {0.0, 0}};
    Qnn_ClientBuffer_t context_0_convReluModel_tensor_7_clientBuf = {NULL, 0};
    Qnn_TensorV1_t context_0_convReluModel_tensor_7_v1 = {0,
                                                          "InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0",
                                                          (Qnn_TensorType_t) 1/*QNN_TENSOR_TYPE_APP_READ*/, 0,
                                                          (Qnn_DataType_t) 562/*QNN_DATATYPE_FLOAT_32*/,
                                                          context_0_convReluModel_tensor_7_quantizeParams,
                                                          4, context_0_convReluModel_tensor_7_dims,
                                                          (Qnn_TensorMemType_t) 0,
                                                          context_0_convReluModel_tensor_7_clientBuf};
    Qnn_Tensor_t context_0_convReluModel_tensor_7 = {
            (Qnn_TensorVersion_t) 1, .v1 = context_0_convReluModel_tensor_7_v1};
    VALIDATE(error, interface.tensorCreateGraphTensor(context_0_convReluModel,
                                                      &context_0_convReluModel_tensor_7));



    //step-10:
    Qnn_Param_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_params[] = {};
    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_inputs[] = {
            context_0_convReluModel_tensor_6
    };
    Qnn_Tensor_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_outputs[] = {
            context_0_convReluModel_tensor_7
    };
    Qnn_OpConfig_t context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0 = {
            (Qnn_OpConfigVersion_t) 1, .v1 = {
                    "InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu",
                    "qti.aisw",
                    "Relu",
                    0,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_params,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_inputs,
                    1,
                    context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0_outputs
            }
    };
    VALIDATE(error, interface.backendValidateOpConfig(backend_0,context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0));
    VALIDATE(error, interface.graphAddNode(context_0_convReluModel,context_0_convReluModel_InceptionV3_InceptionV3_Conv2d_1a_3x3_Relu_0));

    //step-10:
    VALIDATE(error, interface.graphFinalize(context_0_convReluModel, NULL, NULL));

    Qnn_Tensor_t context_0_convReluModel_inputTensors_0[] = {context_0_convReluModel_tensor_0};
    Qnn_Tensor_t context_0_convReluModel_outputTensors_0[] = {context_0_convReluModel_tensor_7};
    VALIDATE(error,interface.graphExecute(context_0_convReluModel, context_0_convReluModel_inputTensors_0,
                                    1, context_0_convReluModel_outputTensors_0, 1, NULL, NULL));


    VALIDATE(error, interface.contextFree(context_0, NULL));

    VALIDATE(error, interface.deviceFree(device_0));

    VALIDATE(error, interface.backendFree(backend_0));

    if (logging) {
        VALIDATE(error, interface.logFree(loghandle));
    }

    if (fclose(fp)) error = -1;

    LOGGI("leave %s", __func__);
    GGML_JNI_NOTIFY("leave %s", __func__);
    return error == 0 || error == QNN_COMMON_ERROR_NOT_SUPPORTED ? 0 : error;
}

I already found that there are different technical paths to utilize the Qualcomm Hexagon NPU in ggml-qnn via QNN SDK:

  1. the general approach in many ggml backend(such as ggml-sycl, ggml-cann, ggml-opencl...), this approach can be found at current implementation of ggml-qnn backend in source file ggml-qnn.cpp:

Image

prons: this approach can benefit greatly from the excellent "backend scheduler" feature in the ggml backend subsystem, can be a "functional implementation" or a good starting-point in the upstream llama.cpp community. accordingly, this approach can be verified easily with my self-made script build-run-android.sh

cons: there mightbe performance concern in ggml-qnn backend

  1. mapping ggml computational graph to QNN computational graph

Image

prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl.

cons: can not take advantage of backend scheduler feature and too much work load

there are many undocumented(or not very clear) technical details in QNN SDK, so I think the necessary technical support should be provided from Qualcomm's tech team even I reach the final mission according to the first approach with help from the great llama.cpp community.

  1. the dedicated tech provided by Qualcomm(although the QNN SDK also provided by Qualcomm), such as QNNSample or XiaoMiStableDiffusionOnDevice, I think this approach doesn't make sense in ggml-qnn's implementation.

  2. the dedicated tech stack (https://github.com/quic/aimet) provided by Qualcomm, I also think this approach doesn't make sense in ggml-qnn's implementation.

correction from domain technical experts is greatly welcomed and appricated.

@oreomaker
Copy link

oreomaker commented Feb 25, 2025

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution.
Maybe a better execution pipeline is needed for it.

@chraac
Copy link

chraac commented Feb 25, 2025

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,

Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution.
To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration:
image

In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

image

@zhouwg
Copy link
Contributor Author

zhouwg commented Feb 25, 2025

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,

Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration: image

In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.

image

I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some intentional challenge comments again and again in my PR:

  • what you did in my first PR cased my blocking in this community from 07/19/2024-----02/16/2025.
  • what you did in my second PR resulted in my PR being closed by the maintainers because two Chinese programmers dropped two much pointless arguments in this community.

thanks so much!

btw, I personally don't think you are a regular employee from Qualcomm because your behavior breaks many default rules and Qualcomm's top-talent regular employee don't do that.

@chraac
Copy link

chraac commented Feb 25, 2025

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

Hi @oreomaker ,
Nice question! This is actually the key point for similar QNN backend implementations. QNN's interface functions more like a traditional ML framework that requires "compilation" (what they call "finalize") before execution. To reduce compilation time, this PR utilizes a mechanism called "graph cache" to store each operation graph with its specific tensor configuration: image
In my fork, I've taken this a step further by generating a QNN graph based on the GGML graph. This approach allows the QNN framework to perform more comprehensive optimizations during compilation.
image

I'm a little curious of that whether you are a regular employee from Qualcomm Shenzhen branch. as I said before many times, you can submit your standalone PR and I personally like to see your success in this community, but pls don't brings some non-technical comments again and again in my PR:

  • what you did in my first PR got me blocked from this community
  • what you did in my second PR resulted in my PR being closed by the maintainers because two Chinese programmers dropped two much pointless arguments in this community and I think this is another joke from China.
    thanks so much!

I don't want to continue debating this here. The reason for restricting your access to this repo was your inappropriate comments on unrelated PRs (here, here2 and here3). And repo's owner gave a clear reason about it.

I'd suggest focusing on improving your codebase in an objective manner without making assumptions about or judging others' work.

If my comments made you uncomfortable, I apologize. I'm happy to step back from this discussion. I can also create my own PR where anyone interested can discuss the design approach more constructively.

@zhouwg
Copy link
Contributor Author

zhouwg commented Feb 25, 2025

as a very old programmer, as I said before: I have no intention of getting involved in a meaningless competition between me and you or your team and I'd like to see your success in this community. what you did seems you are a PUA master(offered an unacceptable help firstly, then angered the other person, then use other's hand to punish other person, at last achieve your purpose). I don't understand why you spent efforts to study my comments in this community? I already admitted my mistake in last year at here.

@zhouwg
Copy link
Contributor Author

zhouwg commented Feb 25, 2025

How do you handle the QNN graph build-execute-free during inference? As we are also integrating the QNN in our framework, the graph building is time consuming and the memory increases hugely when finalizing the QNN graph. It seems that the QNN has no easy way to free a graph during execution. Maybe a better execution pipeline is needed for it.

thanks for your comment although I see you are the code reviewer of a similar PR from a CN programmer chracac and I wish your success in this community firstly.

this is a good question, your concern is correct:

  • the existing execution pipeline comes from ggml's default cpu backend and other hardware-accelerated backends such as ggml-sycl, ggml-cann, ggml-opencl
  • I personally think the highly-well designed QNN SDK will manage it's internal hardware and software resource very carefully that's why they don't provide cleanup APIs such as releaseGraph/releaseTensor/....
  • the graph cache mechanism already used in my PoC of ggml-qnn or my first PR here on 04/2024:https://github.com/kantv-ai/kantv/blob/ggml-qnn-quantize/core/ggml/llamacpp/ggml-qnn.cpp#L2091. your professional help will be helpful for this PR.
  • a better execution pipeline is needed because of performance concern, pls see my simple tech doc:mapping ggml compute graph to QNN compute graph . your professional help will be helpful for this PR.

[updated on 02/26/2025] my previous answer might be wrong, because the first technical approach can works very good(quantized data with mulmat didn't implemented when I wrote that simple tech doc), there are 10x-13x performance improvements in my local dev envs with QNN backend.

btw, you can refer to my personal understanding about ggml-qnn and other ggml backends in that simple tech doc:

prons: this approach might be equivalent to the principle shown in the above quoted code, and I guess that's the secret of how to utilize the Hexagon NPU maximally in QNN backend. I don't know why there is such big difference between ggml-qnn and ggml-sycl/ggml-cann/ggml-opencl.

you will understand what I said there if you spent some time to study Huawei's CANN or Intel's sycl. btw, I have strong media background about FFmpeg & OpenMAX IL / MediaCodec or other hardware acceleration software stack although I know nothing about real hard-core AI tech.

@zhouwg
Copy link
Contributor Author

zhouwg commented Feb 26, 2025

No description provided.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning script Script related testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants